Least Dependent Component Analysis Based on Mutual Information 
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We propose to use precise estimators of mutual information (MI) to find least dependent compo- 
nents in a linearly mixed signal. On the one hand this seems to lead to better blind source separation 
than with any other presently available algorithm. On the other hand it has the advantage, com- 
pared to other implementations of 'independent' component analysis (ICA) some of which are based 
on crude approximations for MI, that the numerical values of the MI can be used for: 

(i) estimating residual dependencies between the output components; 

(ii) estimating the reliability of the output, by comparing the pairwise Mis with those of re-mixed 
components; 

(iii) clustering the output according to the residual interdependencies. 

For the MI estimator we use a recently proposed fc-nearest neighbor based algorithm. For time 
sequences we combine this with delay embedding, in order to take into account non-trivial time 
correlations. After several tests with artificial data, we apply the resulting MILCA (Mutual Infor- 
mation based Least dependent Component Analysis) algorithm to a real- world dataset, the ECG of 
a pregnant woman. 



I. INTRODUCTION 

'Independent' component analysis (ICA) is a statistical 
method for transforming an observed multi-component 
data set x(f) = (xi(t),X2(t), ...,x n (t)) into components 
that are statistically as independent from each other as 
possible Q ■ In theoretical analyses one usually assumes a 
certain model for the data for which a decomposition into 
completely independent components is possible, but in 
real life applications the latter will in general not be true. 
Depending on the assumed structure of the data, one 
typically makes a parametrized guess about how they can 
be decomposed (linearly or not, using only equal times 
or using also delayed superpositions, etc.) and then fixes 
the parameters by minimizing some similarity measure 
between the output components. 

Using mutual information (MI) would be the most nat- 
ural way to solve this problem. But estimating MI from 
statistical samples is not easy. Most existing algorithms 
are either very slow or very crude. Also, the more so- 
phisticated estimates usually do not depend smoothly on 
transformations of the data, which slows down minimum 
searches. In the ICA literature mostly very crude approx- 
imations of MI are used, or MI is completely disregarded 
in favor of different approaches [ij, |2( . In particular, we 
are aware of only very few attempts to pay attention to 
the actual values of the similarities / (in)dependences 
obtained by ICA. Of course it has been recognized sev- 
eral times that even the best decomposition with a given 
class of algorithms (e.g. linear and instantaneous) may 
not lead to strictly independent components, but then 
typically it is proposed to use a decomposition algorithm 
within this class which is different from that for truly 
independent sources 3, ||. An exception is the 'multidi- 
mensional ICA' of jj where the author points out that 
one can use standard decomposition algorithms even in 
case of non-zero dependencies, but also there most of the 
attention is payed to whether components are indepen- 



dent, but not on how dependent they are. The latter can 
be useful for clustering the output, but also for reliability 
and stability testing: A blind source separation into inde- 
pendent components will be the more robust, the deeper 
are the minima of the dependences. In 0,013 such relia- 
bility tests have been proposed based on resampling and 
noise injection. We believe that looking at the depen- 
dence landscape is more direct and conceptually simpler. 

In the present paper we propose to use a recently intro- 
duced MI estimator based on fc-nearest neighbor statis- 
tics H . It resembles the Vasicek estimator [l(| for differ- 
ential entropies which has been applied recently to ICA 
[Til and which is also based on fc-nearest neighbor 
statistics. But while the Vasicek estimator exists only foi- 
l-dimensional distributions and can not therefore be used 
to estimate dependencies via ML our estimator is based 
on the Kozachenko-Leonenko estimator for differen- 
tial entropies and works in any dimension. In addition, 
it seems to give the most precise blind source separation 
algorithm for 2-d distributions known at present. 

Throughout the paper we will only discuss the simplest 
case of linear superpositions. While MI can be applied in 
principle also to nonlinear mixtures, this would be much 
more difficult. 

The paper is organized as follows. In Sec. II we recall 
basic properties of MI and present the MI estimator of 
0. The basic version of MILCA is described in Sec. Ill 
where we also give first applications to toy models, and 
where we will also discuss the reliability of the decompo- 
sitions. In Sec. IV we deal with the case where only some 
groups of output components are independent, with non- 
zero interdependencies within the groups. In this case it 
is natural to cluster the components. We propose to use 
again MI for that purpose, in the form of the mutual 
information based clustering (MIC) algorithm presented 
recently in 14] . In Sec. V we discuss how MILCA (and 
other ICA algorithms) can be combined with time delay 
embedding, in order to take into account non-trivial time 
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structure (in case the data to be decomposed form a time 
series). A thorough discussion of our method and of its 
relations to previous work is given in Sec. VI. Conclusions 
are drawn in the last section, Sec. VII. 



II. MUTUAL INFORMATION 



A. General Properties of MI 



Assume that X and Y are continuous random variables 
with joint density fi(x, y) and marginal densities u x (x) = 



J dy[i{x,y) and H v (y)- Then MI is defined as 



es /t. t 

m 



I(X,Y) 



dxdy y) log 



Hx(x)Hy(y) 



(1) 



The base of the logarithm determines the units in 
which information is measured. In the following, we will 
always use natural logarithms, i.e. mutual information 
will be measured in nats. 

In terms of the differential entropies 



H(X) 



dx^ x (x) \ogii x {x) 



H(Y) = - / dyny(y) log^(y), 



(2) 



(3) 



H (X, Y) = - dxdy fi(x,y) log fi(x, y) (4) 



and 



it can be written as I(X, Y) = H(X)+H(Y)-H(X,Y). 

The most important property of MI is that it is al- 
ways non-negative, and is zero if and only if X and Y 
are independent. Another important feature of MI is 
its invariancc under homcomorphisms of X and Y. If 
X 1 = F(X) and Y' = G(Y) are smooth and uniquely 
invertible maps, then 



I(X',Y')=I(X,Y). 



(5) 



Notice that this is not the case for differential entropies. 
Just as Gaussian distributions maximize the differential 
entropy, giving thereby an upper bound on the entropy 
in terms of the variance of the distribution, Gaussians 
minimize MI [9|. This gives a lower bound on MI in 
terms of the correlation coefficient 



(X.Y) 



[{X^iY 2 )] 1 / 2 



I(X,Y) > --log(l 



V). 



(0) 



(7) 



This might suggest that MI can be decomposed into a 
'linear' part [the r.h.s. of Eq.Q] plus a non-linear part. 



While such a decomposition is of course always possible, 
it is in general not useful. For example, it would also 
suggest that the minimum of MI under linear transfor- 
mations (X',Y r ) = A (X, Y) is always reached when X 1 
and Y' are linearly uncorrelated (in which case r = and 
the r.h.s. of Eq.Q is zero). But it is easy to give coun- 
terexamples for which this is not true (see appendix). 

This is important for MILCA, since it is standard prac- 
tice in ICA to make first a "prewhitening" (principle 
component transformation plus rescaling, so that the co- 
variance matrix is isotropic), and to restrict the actual 
minimization of the contrast function to pure rotations 
0. If one is sure that the sources are really independent, 
this is justified: For the correct sources both Mis and 
covariances are zero. But it is not justified, if there are 
no strictly independent sources and we want to find the 
least dependent sources. 

For any number M of random variables, the MI (or 
'redundancy', as it is often called) is defined as 

M 

I(Xi,X 2 , ■ ■ . Xm) = ^ H(X m ) — H(Xi,X 2 , ■ ■ ■ Xm)- 

m—l 

(8) 

Notice that this is the appropriate definition for ICA or 
MILCA, since it is this difference which one wants to 
minimize. In the literature outside the ICA community 
usually a different construct is called MI 0] , but we shall 
in the following only use Eq.(|HJ|. 

The M-dimensional MI shares with I(X, Y) the invari- 
ance under homeomorphisms for each X rn , and the fact 
that it is bounded by the value obtained for a Gaussian 
with the same covariance matrix |SB_ The next important 
property is the grouping property 



I(X,Y,Z)=I((X,Y),Z)+I(X,Y) 



(9) 



Here, I((X,Y), Z) is the MI between the two variables 
Z and (X, Y) , and we have used the fact that a ran- 
dom variable need not be a scalar. Indeed, anything we 
said so far holds also if X, Y, . . . are multicomponent ran- 
dom variables (except Eq.JJJ which has to be suitably 
modified). Therefore, if we have more than 3 random 
variables, Eq.© can be iterated. For any set of random 
variables and any hierarchical clustering of this set into 
disjoint groups, the total MI can be hierarchically de- 
composed into Mis between groups and Mis within each 
group. This will become important in Sec.V where we 
discuss clustering based on MI. 

Intuitively, one might expect that I{X, Y, Z) — if 
I{X, Y) = I[X, Z) = I{Y, Z) = 0. Pairwise strict inde- 
pendence would then imply global independence. That 
this is not true is demonstrated in the appendix with 
a simple counter example. It becomes important for 
chaotic deterministic systems. If xi,x 2 , ■ ■ - Xn is a uni- 
variate signal produced by a strange attractor with di- 
mension d, then any d-tuple of consecutive xt values will 
be weakly dependent, while any m-tuple with m > d will 
be strongly dependent. 
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The last property to be discussed here is related to 
homeomorphisms involving a pair of variables (X, Y), i.e. 
(X',Y') — F(X,Y). Using the grouping property and 
the invariance under homeomorphisms of a single variable 
we obtain |9j 

I(X', Y',Z,...) = I{X, Y,Z,...)+ [I(X', Y') - I(X, Y)] . 

, . (10) 

This is important if we want to minimize the MI with re- 
spect to linear transformations. Since any such transfor- 
mation in M dimensions can be factorized into pairwise 
transformations, this means that we only have to com- 
pute pairwise Mis for the minimization. To find the ac- 
tual value of the minimum, we have of course to perform 
one calculation in all M dimensions. We also have to 
estimate higher order Mis directly, if we want to use the 
method of Sec. V.A with embedding dimension m > 2. 

B. MI Estimation 

Assume that one has a set of N bivariate measure- 
ments, (xi, yi), i = 1, . . . N, which are assumed to be iid 
(independent identically distributed) realizations of the 
random variable Z = (X, Y). Our task is to estimate 
MI, with or without explicit estimation of the unknown 
densities n(x, y), [i x (x), and H y {y)- 

Two classes of estimators were given in 9]- In con- 
trast to other estimators based on cumulant expansions, 
entropy maximalization, parameterizations of the densi- 
ties, kernel density estimators or binnings (for a review 
of these methods see 9] ) , the algorithms proposed in [i| 
are based on entropy estimates from fc-nearest neighbor 
distances. This implies that they are data efficient (with 
k = 1 we resolve structures down to the smallest possi- 
ble scales) , adaptive (the resolution is higher where data 
are more numerous), and have minimal bias. Numer- 
ically, they seem to become exact for independent dis- 
tributions, i.e. the estimators are completely unbiased 
(and therefore vanish except for statistical fluctuations) 
if /z(x, y) = n(x)n(y). This was found for all tested distri- 
butions and for all dimensions of x and y. It is of course 
particularly useful for an application where we just want 
to test for independence. 

In the following we shall discuss only one of these 
two classes, the one based on rectangular neighborhoods 
called i^(X,Y) in @. 

C. Formal Developments 

We will start from the Kozachenko-Leonenko estimate 
for Shannon entropy H [H [H d [3 : 

d N 

H{X) = -^(fc) + 1> (N) + logQ + - log c(i) (11) 

i=l 

where ip(x) is the digamma function, e(i) is twice the dis- 
tance from Xi to its fc-th neighbor, d is the dimension of x 



and Cd is the volume of the d-dimensional unit ball. Mu- 
tual information could be obtained by estimating H(X), 
H(Y) and H(X,Y) separately and using 

I(X,Y) = H(X) + H(Y)-H(X,Y) . (12) 

But for any fixed fc, the distance to the fc-th neighbor in 
the joint space will be larger than the distances to the 
neighbors in the marginal spaces. Since the bias from the 
non-uniformity of the density depends of course on these 
distances, the biases in H(X), H(Y), and in H(X,Y) 
would not cancel. 

To avoid this, we notice that Ea. l|ll|l holds for any 
value of fc, and that we do not have to choose a fixed 
fc when estimating the marginal entropies (this idea was 
used first, somewhat less systematically, in 19]). So let us 
denote by e x (i) and e y (i) the edge lengths of the smallest 
rectangle around point i containing k neighbors, and let 
7i x {i) and n y (i) (the number of points with ||izii — Xj\\ < 
e x (i)/2 and \\yi — yj\\ < e y (i)/2) as the new number of 
neighbors in the marginal space. The estimate for MI is 
then: 

I(X, Y) = i>(k) - 1/fc - Wn x ) + V>K)} + ip(N). (13) 

We denote by (...) averages both over all i 6 [1, ... TV] 
and over all realizations of the random samples. 

Here we will show results of I(X, Y) for Gaussian dis- 
tributions (cf. Fig.nj. Let X and Y be Gaussian signals 
with zero mean and unit variance, and with covariance 
r. In this case I(X, Y) is known exactly, 

I Gauss (X,Y) = -^\og(l-r 2 ) (14) 

Apart from the fact that indeed I(X, Y) — Iq^s^X , Y) — > 
for N — > oo, the most conspicuous feature is that the 
systematic error is compatible with zero for r = 0. This is 
a property which makes the estimator particular interest- 
ing for ICA because there we are looking for uncorrelated 
signals. For non-Gaussian signals, our estimator still has 
a smaller systematic error than other estimators in the 
literature 

Using the same arguments for n random variables 
Aij_ X 2 , . . . X m , the MI estimate for I(Xi, X 2 , ■ ■ ■ X m ), 
is ||: 

I(X 1 ,X 2 ,...X m ) = ^(fc) - (m - l)/fc + (m - l)V(iV) 
- (ip(n xi ) + ip(n X2 ) + ...ip(n Xm )) 



D. Practical considerations 

By choosing proper values for fc, the algorithm allows 
to minimize either the statistical or the systematic errors. 
The higher is fc, the lower is the statistical error of I. 
The systematic error shows exactly the opposite behav- 
ior. Thus, to keep the balance between these two errors, 
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0.01 0.02 0.03 0.04 0.05 

1 /N 

FIG. 1: Estimates of average values of I(X, Y) — /exact (A, Y) 
for Gaussian signals with unit variance and covariances r = 
0.9,0.6,0.3, and 0.0 (from top to bottom), plotted against 
1/N. In all cases k = 1. The number of realizations is > 2 x 
10 6 for TV <= 1000, and decreases to « 10 s for N = 40, 000. 
Error bars are smaller than the sizes of the symbols. 



the best choice for k would lie in the middle range. But 
for some cases it makes sense to deviate from this, e.g. 
when we want to find most independent signal sources. 
There the true values of the MI are small, and thus also 
the systematic errors for all k. In this case it is better 
to use large k in order to reduce statistical errors. On 
the other hand, when the data files are very long we do 
not have to worry about statistical errors and we should 
choose k small. 

Most of the CPU time for estimating MI with our new 
estimator is used for neighbor searching. In Q we pre- 
sented three implementations which ranged from very 
simple but slow to sophisticated and fast. In the fol- 
lowing we shall always use the fastest implementation 
which uses grids to achieve a CPU time ~ N log N for N 
points. We will not use rank ordering (as also discussed 
in [9j), but we will add small Gaussian jiggles (amplitude 
« 1CU 8 ) to all measured values in order to break any 
degeneracies due to quantization in the analog-to-digital 
conversion 



III. MILCA WITHOUT USING TEMPORAL 
STRUCTURES 

A. Basic Algorithm 

In this Subsection we will show how the linear instan- 
taneous ICA problem is solved using the new MI estima- 
tor. We will apply this then to several artificial data sets 
which are constructed by superimposing known indepen- 
dent sources, and we will compare the results with those 
from several other ICA algorithms. 

In the simplest case x(f) is an instantaneous lin- 



ear superposition of n independent sources s(t) — 
(s 1 (t),s 2 (t),...,s 1l (t)), 

x(i)=As(i), (15) 

where A is a non-singular n x n 'mixing matrix'. This 
means that the number of sources is equal to the number 
of measured components. In this case, we know that a 
decomposition into independent components is possible, 
since the inverse transformation 

s(i) = Wx(t) with W = A" 1 (16) 

does exactly this. If Ea.Q15|) does not hold then no de- 
composition into strictly independent components is pos- 
sible by a linear transformation like Eq. I|16(l , but one can 
still search for least dependent components. 

But even if Ea. (|15fl does hold, the problem of blind 
source separation (BSS), i.e., of finding the matrix W 
without explicitly knowing A, is not trivial. Basically, 
it requires that x is such that the components of any 
superposition s' = W'x with W' ^ W are not indepen- 
dent. Since linear combinations of Gaussian variables are 
also Gaussian, BSS is possible only if the sources are not 
Gaussian. Otherwise, any rotation (orthogonal transfor- 
mation) s' = Rs would again lead to independent com- 
ponents, and the original sources s could not be uniquely 
recovered. Since any ICA algorithm will find a more or 
less meaningful solution, we need a reliability test for the 
obtained components. This is given in subsection C. 

As a first step, the matrix W is usually decomposed 
into two factors, W = RV, where the prewhitening V 
transforms the covariance matrix into C = VCV T = 1, 
and R is a pure rotation. Prewhitening is just a princi- 
pal component analysis (PCA) together with a rescaling. 
The ICA problem proper reduces then to finding a suit- 
able rotation for the prewhitcned data. 

The motivation for this is that any reasonable contrast 
function used for the ICA will give least dependent com- 
ponents which are also uncorrelated. In Sec. II we have 
seen that this is not always the case, but that it is true 
whenever the components are really independent. One 
can take now several different attitudes. The most rad- 
ical is to abandon prewhitening altogether (for different 
reasons for not to use prewhitening, see 20] ). But this 
slows down the algorithm considerably. Also, prewhiten- 
ing can be detrimental only when there are residual de- 
pendencies between the optimal components, and it is 
not clear what is the significance of such components. In 
the following we shall always use prewhitening unless we 
say explicitly the opposite. We shall always assume that 
the prewhitening step has already been done, and we will 
restrict the proper ICA (or rather LCA) transformations 
to pure rotations. As a third alternative one could first 
use prewhitening, but try at the end whether some non- 
orthogonal transformations improve the results further. 
We have not yet studied this strategy. 

The aim of ICA is now to minimize I{X\ . . . X n ) un- 
der a pure rotation R. Any rotation can be represented 



5 



pdfs 


FastICA 


Jade 


Imax 


KCCA 


KGV 


RADICAL 


MILCA 


MILC A ( augmented) 


a 


4.4 


3.7 


1.8 


3.7 


3.0 


2.1 


2.7 


2.4 


b 


5.8 


4.1 


3.4 


3.7 


2.9 


2.7 


2.9 


2.5 


c 


2.3 


1.9 


2.0 


2.7 


2.4 


1.2 


1.5 


1.0 


d 


6.4 


6.1 


6.9 


7.1 


5.7 


5.3 


7.0 


4.3 


e 


4.9 


3.9 


3.2 


1.7 


1.5 


0.9 


0.9 


1.0 


f 


3.6 


2.7 


1.0 


1.7 


1.5 


1.0 


0.9 


0.9 


g 


1.8 


1.4 


0.6 


1.5 


1.4 


0.6 


0.6 


0.6 


h 


5.1 


4.1 


3.1 


4.6 


3.6 


3.7 


3.4 


3.3 


i 


10.0 


6.8 


7.8 


8.3 


6.4 


8.3 


7.9 


8.0 


j 


6.0 


4.5 


50.6 


1.4 


1.3 


0.8 


0.7 


0.8 


k 


5.8 


4.4 


4.2 


3.2 


2.8 


2.7 


2.4 


2.3 


1 


11.0 


8.3 


9.4 


4.9 


3.8 


4.2 


4.1 


3.3 


m 


3.9 


2.8 


3.9 


6.2 


4.7 


1.0 


1.0 


0.8 


n 


5.3 


3.9 


32.1 


7.1 


3.0 


1.8 


2.0 


1.6 


o 


4.4 


3.3 


4.1 


6.3 


4.5 


3.4 


3.4 


2.9 


P 


3.7 


2.9 


8.2 


3.6 


2.8 


1.1 


1.6 


1.2 


q 


19.0 


15.3 


43.3 


5.2 


3.6 


2.3 


2.9 


1.9 


r 


5.8 


4.3 


5.9 


4.1 


3.7 


3.2 


3.5 


2.7 


mean 


6.1 


4.7 


10.6 


4.3 


3.3 


2.6 


2.7 


2.3 



TABLE I: Performance indices (multiplied by 100) for two-component blind source separatiorytest problem (A). The results in 
the first 6 columns (FastICA, Jade, Imax, KCCA, KGV, and RADICAL) are taken from Ref. 131 . where also references to these 
algorithms are given and where the probability distribution functions (pdfs) 'a' - 'r' are defined. The last two columns show 
the results of MILCA, first in its simplest version (column 7) and then with data augmentation as proposed in 12] (column 8). 
Each performance index is an average over 100 replicas, each replica consisting of 1000 pairs of numbers drawn randomly from 
the pdfs. For MILCA we used k = 10, and we fitted 7(0) by Fourier sums with 3 (MILCA) and 5 terms (augmented MILCA), 
respectively. 
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FIG. 2: Test problem (B), consisting of 5 input channels. Left 
panel: Averaged performance index P err from the output of 
FastICA Q (parameters with lowest MI), JADE [24J, TD- 
SEP (same parameters as in |6j), and MILCA (k=30). Right 
panel: same as left side, but with total MI I (k=3) used as 
performance measure. 



FIG. 3: Test problem (C), with 7 input channels. Left panel: 
Averaged I(si . . . s n ) (k=3) from the output of FastICA 0] 
(parameters with lowest MI), JADE 0, TDSEP (same pa- 
rameters as in and MILCA (k=30). The horizontal line 
indicates the true MI of the input channels. Right panel: 
Pairwise MI estimates I between all channel combinations, 
for the MILCA output components shown in Fig. 0] (diagonal 
is set to zero). 



as a product of rotations which act only in some 2x2 
subspace, R = rj;,j R-ij (</»), where with 

Rij(0)(xi . . . Xi . . . Xj . . . x n ) = [x\ . . . x\ . . . x'j . . . x n ) x 'i — cos 4> x i + sm 4> x j> x 'j — ~ sin0 x\ + cos 4> x j ■ 

(17) (18) 
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For such a rotation one has (see Eg. llOj l) 

7(R y (0)X) - /(X) = I(Xl,X$) - I{X it Xj) , (19) 

i.e., the change of I{X\ . . . X n ) under any rotation can 
be computed by adding up changes of two- variable Mis. 
This is an important numerical simplification. 

To find the optimal angle in a given plane, we 
calculated Iij(4>) = I{Xl,X'.) for typically 150 different 
angles in the interval [0, 7r/2], fitted these values by typ- 
ically 3-15 Fourier components, and took then the mini- 
mum of the fit. The latter is useful because /(</>) is not 
smooth in <f>, for essentially the same reasons as discussed 
in 0. We also tried the augmentation proposed in 
to smoothen I(X' ,Y'). It worked equally well, by and 
large, as the Fourier filtering, but it was much slower. 

Now the resulting MILCA-algorithm can be summa- 
rized: 

1. Preprocess (center, filter, detrend, ...) and whiten 
the data; 

2. For each pair with i,j = 1 . . . n find the an- 
gle <f> which minimizes a smooth fit to Zy (0) = 

3. If I(X[ . . . X' n ) has not yet converged, go back to 
step 2. Else, §i = X[ are the estimates for the 
sources. 

The order of choosing the sequence of pairs in point 2 
is not essential. In our numerical simulations the conver- 
gence speed did not differ significantly whether we went 
through the pairs systematically or randomly. 



B. Numerical Examples and Performance Tests 

(A) As a first test we study the set of 18 problems 
proposed by Bach & Jordan [21j and studied also in 0] . 
Each problem corresponds to a 1-d probability distri- 
bution p(x). One thousand pairs of random numbers 
x and y, each drawn iid from p(x)p(jj), are mixed as 
x' = x cos 4> + y sin <f>, y' = —x sin <p + y cos <fi with random 
angle (j> common to all pairs (i.e., A is a pure rotation). 
Using MILCA, we obtained then the estimate A. This is 
repeated 100 times with different angles 4> and with dif- 
ferent random sets of pairs (x, y). To assess the quality 
of the estimator A (or, equivalently, of the back transfor- 
mation W = A -1 ), we use the Amari performance index 



p 



1 N 

2N ^ v 



\Pi. 



\Pi. 



' x max fe \p ik | max fe \p kj \ 



(20) 



where p tj = (A 1 A) ij . 

Results are given in Table 1 (column 'MILCA') and 
compared there to the results of previous algorithms 
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FIG. 4: Seven output channels of the MILCA algorithm, test 
problem (C). 



given in |T3 | . They are excellent in average and surpassed 
only by the RADICAL algorithm proposed in 0] which 
also uses an entropy estimate based on neighbor dis- 
tances, but for the differential Shannon entropies H(x') 
and H(y'). Another feature used in |12| is data aug- 
mentation: To obtain a more smooth dependence on the 
angle cf>, each data vector (x, y) is replaced by an R-tuple 
(with R = 30) of near-by points. The same augmenta- 
tion trick can be used also for MILCA, and improves the 
results for very similar reasons. Indeed, our results ob- 
tained with MILCA and with data augmentation, given 
in the last column of Table 1, are even better than those 
of RADICAL. In the following tests we did not use data 
augmentation, because it is rather time consuming. 
(B) As a second test we study an example taken from 

0. In involves five input sources (a sine wave, two dif- 
ferent speech signals [the first half of "Houston, we have 
aproblem" and "parental guidance is suggested" from 
|23], one white Gaussian noise, one uniformly distributed 
white noise) (5000 data points each) which are linearly 
mixed with a 5 x 5 matrix A to form five output sig- 
nals. In mixing these components, no time delay is used, 

1. e. the superpositions are strictly local in time. For this 
example it is possible to find the inverse transformation 
W = A -1 up to a permutation and up to scaling factors, 
because all sources are independent from each other and 
only one has a Gaussian distribution. To assess the qual- 
ity of this back transformation we again use the Amari 
performance index. 

The results obtained with two hundred different ran- 
dom mixtures of the sources (with uniformly distributed 
mixing matrices and with different realizations of the 
random channels for each mixture) are compared in the 
left panel of Fig. with three standard algorithms: Fas- 
tlCA Q, JADE |H, and TDSEP 0. We found that 
FastICA sometimes gets stuck in a local minimum, and 
runs differing only in the initial conditions can produce 
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different results. The error bars shown in Fig. [5] indi- 
cate the resulting uncertainty of the performance mea- 
sure, estimated from 20 realizations that differ only in 
initial conditions. The errors of JADE and FastICA are 
mainly due to their difficulty to separate one of the au- 
dio channels from the Gaussian noise. TDSEP is not 
able to decompose the two noise channels, since it is also 
not designed for this purpose (it uses time structures to 
separate signals). Very good results for all 200 mixtures 
are obtained by MILCA, although the audio signals are 
quit noisy and have nearly Gaussian distributions. The 
performance of JADE and FastICA compared to MILCA 
becomes better when the quality of the acoustic signals 
improves. 

In addition to the Amari index, another (more direct) 
way to judge the accuracy of the source estimates is to 
look at the estimated Mis. If and only if the sources were 
estimated correctly, the MI should be zero. In the follow- 
ing we propose to use both the matrix of pairwise esti- 
mators I(si, Sj) and the estimated total MI Z(si . . . s n ). 
The important advantage over the Amari index is that 
they can also be used when the exact sources are not 
known. Low values of the MI indicate that both the data 
is a mixture of independent components, and the separa- 
tion algorithm worked well in producing some indepen- 
dent components. Notice that it cannot be expected in 
general that the components found are identical to the 
sources, e.g., if some of them are Gaussians. In Fig. 
again MILCA shows the best performances. 

Notice the very big difference between FastlCA/JADE 
and TDSEP in the right panel of Fig. [2] which is much 
bigger than that measured with the Amari index. The 
first two have problems in separating one of the acous- 
tic signals (signal #4 in Fig. 0} from the Gaussian, be- 
cause it has a nearly Gaussian amplitude distribution, 
but for the same reason this is not punished by a large 
MI between the outputs (improved performance index 
see later in Fig. I14fl . TDSEP, using time information, 
has no problem with this, but cannot separate uniform 
from Gaussian noise - and is heavily punished for that by 
MI. In Sec. V we will show how to improve MILCA such 
that it can better separate components which have nearly 
Gaussian amplitude distributions but different time cor- 
relations. Using that improved MILCA will give much 
bigger performance difference with algorithms like Fas- 
tlCA/JADE. 

(C) Next we want to investigate the case where the 
decomposition is neither perfectly nor uniquely possible. 
Such an example can be constructed by simply adding 
one cosine with the same frequency as the sine and one 
more Gaussian channel to the last test case. This now 
violates the assumption of independent sources, because 
the sine and cosine are strongly dependent. The theoret- 
ical value for the MI would be infinite, but a numerical 
estimator from a finite data sample gives a finite value, 
in our case 7 (Si . . . S„) = 0.72 j2f|. But for this example, 
perfect blind source separation is impossible also because 
the two Gaussians are not uniquely decomposable. We 




FIG. 5: Performance index distributions over 7000 triples of 
three-component mixtures. For histograms (a),(b) the origi- 
nal spectra were decomposed, for (c)-(e) their second deriva- 
tives. 



want to know how an ICA algorithm performs in view of 
such problems. It should still be able to separate those 
components which can be separated. 

The total output MI is shown in the left panel of Fig. [3] 
We see that for all algorithms the MI is higher than 
the MI between the input channels, which serves essen- 
tially as a consistency test. The difference is smallest 
for MILCA. The Mis between all pairwise channel com- 
binations obtained with MILCA are shown in the right 
panel of Fig. [31 They show again that MILCA has done 
a perfect job: All components are independent except 
for those which should not be. MILCA output is shown 
directly in Fig. 21 Although we do not show the input, 
it is clear that the separation has been as successful as 
possible. 

(D) There are a number of blind source separation 
problems in the field of analytical spectroscopy where 
quantitative spectral analysis of chemical mixtures is for- 
mulated as multivariate curve resolution (for recent re- 
views see HI and as an ICA problem H H |H ) . 
Assuming Beer's law, the spectrum of a mixture of pure 
constituents with spectra Si(y) and concentrations Ai is 
x { v ) = Ei^» s i( I/ )- Given a set of N mixtures and N 
pure components, we can then write this in vector nota- 
tion as x(f) = As(v), analogous to Eci. (|15J) . The task is 
to obtain estimates s(v) for the pure components. This 
is the instantaneous linear ICA problem, except that in 
most applications of interest the spectral sources are not 
independent but have overlapping bands. This happens 
when chemical compounds in a mixture share several 
common or similar structural groups that demonstrate 
nearly the same spectral patterns. 

This difficulty makes mixture decomposition quite 
nontrivial for many BSS techniques used in chemomet- 
rics, unless interactive band selection (e.g. SIMPLISMA 
H3, IPC A HI, BTEM 34]) is employed to avoid using 
those parts of the signals where severe overlaps reduce 
the quality of decomposition. Such preprocessing made 
by hand is, of course, a bit of an art, because these un- 
safe bands can not be known a priori in a blind problem. 
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Since the focus here is rather on developing general pur- 
pose algorithms, we aim at using MILCA without inter- 
active preprocessing in order to estimate its pure overall 
efficiency in cases when residual dependencies play a role. 

To test the performance of MILCA on typical spectral 
data we collected a pool of 62 experimental molecular 
infrared absorption spectra in the range 550-3830 cm -1 
(822 data points each) taken from the NIST database 
[35| . This test set was selected to contain organic com- 
pounds with common structural groups (benzene deriva- 
tives, phenols, alcohols, thiols) so that their spectra have 
multiple overlapping bands and, thereby, are mutually 
dependent. Then a sample of 7000 triples of three- 
component mixtures was constructed by choosing spec- 
tra randomly from the pool and applying random mix- 
ing matrices A 36]. For each decomposition the Amari 
performance index was computed. Fig. El compares its 
distributions for several different ICA algorithms includ- 
ing FastlCAQ, RADICAL 12] and Nonnegative PCA 
(NNPCA) HJ. The latter uses the fact that pure spec- 
tra are non-negative and the same should hold for the 
estimates, so the nonnegativity is imposed as a soft con- 
straint on the estimates Si{v) in an optimization proce- 
dure. But our simulations showed that this constraint 
is often not fulfilled, and in some cases the output of 
NNPCA (as well as that of other algorithms) is nega- 
tive. To a large part this is due to dependencies be- 
tween the sources. Already prewhitening (i.e. PCA 
and rescaling) sometimes leads to decorrelated compo- 
nents which cannot be made nonnegative by any subse- 
quent rotation. Trying to enforce nonnegativity neglect- 
ing other aspects might then be counterproductive, and 
this might partly explain the relatively poor performance 
of NNPCA (Fig. Eh). 

NNPCA has to be applied to the original spectra, while 
it is well known that using derivatives of spectroscopic 
signals with respect to frequency can improve the results 
(see, e.g., |23,|31|). Taking such derivatives extracts the 
spectral information which is more independent between 
the sources Q- In our numerical experiments, second 
order derivatives approximated by finite differences 

— Y- ~x(i/ i _i)-2x(i/ i )+x(i/ i+ i). 21 

gave best performance |38| . This is clearly seen in the ex- 
ample of FastICA (compare distributions (b) and (c) on 
Fig. Ell ■ But MILCA (e) and RADICAL (d) with second 
derivative data perform better than FastICA (c), and 
are almost equally good when compared to each other. 
Furthermore, our numerical results confirmed that non- 
negativity is satisfied whenever the decomposition is suc- 
cessful (Amari index below 0.05) (see also the discussion 
in (3^). But whether this is fulfilled depends primarily 
on the dependencies between the original signals, and less 
on the algorithm employed. 

A more detailed study of the potential of MILCA in 
multivariate spectral curve resolution will be given in 
a forthcoming publication |40j which will focus on the 
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FIG. 6: Square roots of variabilities <7ij of I(R(Xi, Xj)) (with 
k = 6) from MILCA output for test problem (C) (Fig. g). 
Elements on the diagonal have been set to zero. 



analysis of experimental mixtures and, in particular, on 
the comparison with recently developed interactive algo- 
rithms such as BTEM p|. 

C. Reliability and Uniqueness of the ICA Output 

Obtaining the most independent components from a 
mixture is only the first part of an ICA analysis. Check- 
ing the actual dependencies between the obtained com- 
ponents should be the next task, although it is most often 
ignored. We have seen that it becomes easy and natural 
with MILCA, which was indeed one of our main moti- 
vations for MILCA. The next task after that is to check 
the reliability, uniqueness, and robustness of the decom- 
position. We have already discussed this in the last sub- 
section for test example (C), but not very systematic. A 
systematic discussion will be given now. 

Recently proposed reliability tests 0, 0, II] are based 
on bootstrap methods or noise injection. We here present 
an alternative procedure which again makes use of the 
fact that MILCA gives reliable estimates of the actual 
(in-) dependencies: We test how much the estimated de- 
pendencies change under re-mixing the outputs. 

In the simplest multivariate signal with n com- 

ponents is an instantaneous linear mixture of n indepen- 
dent sources. This was the model we started with in sub- 
section A. We assume it to apply when (i) all estimated 
pairwisc Mis between all ICA components fall below a 
defined threshold, I(si,Sj) < -D max for all i,j = l,...,n 
and i ^ j, and (ii) the overall MI I(si . . . s n ) is below 
another threshold. Notice that the first criterion alone is 
not sufficient, see the appendix. 

In real-world data, however, we are usually confronted 
with deviations from this simple model. The next sim- 
ple possibility is that some pairwise Mis are still exactly 
zero, but others are not. Let us draw a graph where 
each of the n output channels is represented by a ver- 
tex, and each pair of vertices is connected by an 
edge if I(si,3j) > -D max - This gives a partitioning of 
the set of output components into connected clusters 
Ci, . . . C m with m < n. If, in addition, the MI between 
these clusters, I{C\, . . . C m ), is below another suitably 
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chosen threshold, we consider each cluster to be indepen- 
dent (notice that we do not require all channels within a 
cluster to have a MI above the threshold D max ). This 
is essentially our version of multidimensional ICA 
It uses exactly the same basic MILCA algorithm as de- 
fined above, and is thus much simpler conceptually than 
the 'tree-dependent component analysis' of [3j. Its main 
drawback is that it is not sensitive to the actual strengths 
of the non-zero interdependencies. A better algorithm 
which does take them into account will be discussed in 
Sec.IV. 

In addition to this first step of an ICA output analy- 
sis, we have to test for the uniqueness of the components. 
For this purpose, we check whether the (one- or multi- 
dimensional) sources obtained by the ICA algorithm in- 
deed correspond to distinct minima of the contrast func- 
tion or whether other linear combinations exist which 
show approximately the same overall dependencies. An 
example for the latter case is given by two uncorrelated 
Gaussian signals. They remain independent under rota- 
tion H3. 

A good estimator for the uniqueness of the ICA output 
is the variability of the pairwise MI under remixing, i.e. 
under rotations in the two-dimensional plane: 

(Xij = I{X u Xj) - kj (<j> mia ) for i ^ j (22) 
where the global minimum of I is at <j) = 4>wini an d 

I(X i ,X j ) = - [ ' #4W) (23) 
^ Jo 

(notice that Iij((f>) is periodic in <j) with period 7r/2). For 
unique solutions the MI will change significantly (large 
Uij), but it will stay almost constant for ambiguous out- 
puts (small Uij ) . 

Results for the MILCA output of test problem (C) are 
shown in Fig. El (to aid in the interpretation, the actual 
output signals were shown in Fig. The basic ICA 
model is violated both in the Gaussian noise subspace 
and the sin/cos subspace. In the Gaussian subspace the 
components are independent, but it should be impossible 
to find a unique decomposition. Indeed, 05^ ~ (FigEJ) 
and 75,6 ~ (Fig. |3J). For the dependent components 
{sin/cos subspace) the situation is different. We expect 
to have a = also here, corresponding to the isotropy 
of the distribution in this subspace. But I should be 
much larger than zero, because the two signals are not 
independent. Indeed, we see ~ and 7i j 3> 0. In 
general, it depends on the specific application whether 
one should attribute any meaning to when compo- 
nents i and j are not independent. Finally, we conclude 
from Fig. fright) and Fig. that the channels 3, 4 (au- 
dio signals), and 7 (uniformly distributed noise) are one 
dimensional sources, because they are independent from 
any channel, I^j ~ jj^ ~ In « 0, and are reliable, 
03, i ~ <T4,i ~ 07,1 S> 0. 
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FIG. 7: Unsmoothened estimates of I(<j>) for two randomly 
mixed uniform distributions, corrupted with isotropic Gaus- 
sian measurement noises with different signal-to-noise ratios 
SNR = 00, 13, 7, 4, 1 (from top to bottom), plotted against <p. 
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FIG. 8: Averaged Amari index against the signal-to-noise ra- 
tio. The condition number of the mixing matrices is 6. The 
upper curve (in the SNR range from 7 to 3) is for standard 
MILCA, the lower for n-MILCA. 



D. Noisy Signals 

Because our aim is to apply MILCA to real world data, 
we have to discuss the influence of measurement noise. 
In the literature there exist several algorithms which 
are specially tailored to this problem (see e.g. Ref. pj, 
chapter 15). Typically, in order to obtain optimal per- 
formance, the noise is assumed to satisfy very special 
properties like being additive, uncorrelated, isotropic and 
Gaussian. Below we will present a modified MILCA al- 
gorithm which assumes that we have measurement noise 
with exactly these properties. 

Alternatively, one can take just a standard ICA al- 
gorithm (in our case MILCA as described above), and 
analyze how its output depends on the noise level. In 
the following we will compare both approaches. 

We start with two uniformly distributed variables and 
mixed them with a random 2x2 matrix with a fixed 
condition number. After that, iid Gaussian noises are 
added to each of the two mixtures. The amplitudes in 
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both channels are the same, 



Xi(t) 



?(*)+%(*) 



(24) 



with (r]i(t)rij(t')) — rSijSw- For the case were we do 
not use any information of the measurement noise sig- 
nals Xi(t) are then simply used as input in MILCA. In 
Fig. [7| we show I{4>) for the same mixing matrix but dif- 
ferent signal-to- noise ratios SNR ~ vai(si(t))/r. We see 
that I becomes flatter (the variability with respect to the 
mixing angle decreases) with decreasing SNR 42]. The 
presence of noise leads also to a shift of the minimum. 
Both effects introduce errors in estimating the original 
mixing matrix. The upper curve in Fig. |H1 shows the av- 
eraged Amari index over 100 realizations with different 
noise and mixing matrices. 

To reduce this error we modify MILCA to n-MILCA 
(noisy MILCA). At first we do a 'quasiwhitening' with 
the estimated covariance matrix V = (C x — rl)^ 1 / 2 of 
the pure signals (see, e.g., lj, chapter 15) to decorre- 
late the original sources. As a consequence of this, the 
noise will become now correlated, and with it also the 
entire 'quasiwhitcned' signal. Because of this we should 
not minimize /(</>), since in this way we would introduce 
a bias as seen in Fig. [3 towards wrong values of <fi. In- 
stead we minimize I(4>) + ^log(l — Cij((j)) 2 ) where we 
have subtracted the 'linear' contribution (see Eq.JTJ). In 
Fig. |H1 we show again the averaged Amari index for the 
same realizations as used before. Making use of detailed 
information on the noise clearly improved the results, 
except for very small SNR. The amount by which it im- 
proves depends on the condition number of the mixing 
matrix. For matrices far away from singularity (low con- 
dition number) the quasiwhitening has little effect and 
there is hardly any difference, while for large condition 
numbers the two mixtures are nearly the same and it is 
impossible to obtain good results with either algorithms. 

Finally, before leaving this subsection, let us say a few 
words about outliers. Outliers are just a special case 
of noise. Because our MI estimator is based on the k- 
nearest neig hbor distribution, outliers make less difficul- 
ties Ref.[l^| than e.g. in kurtosis based algorithms. 



E. A Real- World Application 

Finally, let us apply MILCA to a fetal ECG recording 
from the abdomen and thorax of a pregnant woman (8 
electrodes, 500 Hz, 5 seconds). We chose this data set 
because it was analyzed several times with different ICA 
algorithms 0> IE 0> EH an d is available on the web 01 • 

The output components of MILCA are shown in Fig. [5] 
[45| . We used k = 30 neighbors for estimating MI, and 
to obtain the minima of Iij (</>) we fitted with 3 Fourier 
components. The success of the decomposition is already 
seen by visual inspection. Obviously, channels 1-2 are 
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FIG. 9: MILCA output: components after minimizing 
I(Xi . . . Xs) for the heart beat example of Sec. III.E. 



dominated by the heartbeat of the mother, and chan- 
nel 5 by that of the child. Channels 3, 4, and 6 still 
contain heartbeat components (of mother and child, re- 
spectively), but look much more noisy. Channels 7 and 8 
seem to be dominated by noise, but with rather different 
spectral compositions. 

In order to verify this also formally (which would be 
essential in any automatic real-time implementation) we 
first show in Fig. ^| (left panel) the pairwise Mis. We see 
that most Mis are indeed small, except the one between 
the first two components. This indicates again that the 
first two components belong to the same source, namely 
the heart of the mother. But some of the other Mis seem 
to be definitely non-zero, even if they are small. This 
indicates that the decomposition is not perfect, as is also 
seen by closer inspection of Fig. |5] 

Finally, we show in the right panel of Fig. the vari- 
abilities under re-mixing. They confirm our previous 
findings. In contrast to the sine/cosine pair in test exam- 
ple (C), the first two components have non-zero cr, show- 
ing that the distribution in this subspace is not isotropic 
and that one can minimize the interdependence in it by 
a suitably chosen demixing. Apart from that, the biggest 
values of cr are for channels 1,2, and 5, showing that these 
channels are most reliably and uniquely reconstructed. 
They are just the channels dominated most strongly by 
heart beat. 



IV. CLUSTER ANALYSIS 

We pointed already out that the usual assumption of 
independent one-dimensional sources as in Ea. H15l) is of- 
ten unrealistic. Take e.g. the ECG discussed in the pre- 
vious subsection, and assume that both hearts - the one 
of the mother and the one of the fetus - are indepen- 
dent chaotic dynamical systems. A chaotic system with 
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FIG. 10: Left panel: I between all the pairwise combinations 
of the signals shown in Fig. [§] Right panel: Square roots of 
variabilities a%j of Iij ((/>). In both panels the values on the 
diagonal are set to zero. 
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FIG. 11: Dendrogram for Fig. [5] The height of each cluster 
(ij) corresponds to I(Xi,Xj) (fc=6). 



continuous time must have at least 3 excited degrees of 
freedom |4^. With any generic placement of the elec- 
trodes, we should then expect to pick up > 3 different 
components from each heart. These components must 
be strongly dependent on each other, even after having 
been whitened [40| . Thus each heart must contribute to 
at least 3 output components in any linear ICA scheme. 
For the mother heart we have indeed found 2 compo- 
nents. The fact that we have not clearly identified more 
dependent components in the output should be consid- 
ered as failure of the instantaneous linear algorithm and 
will be dealt with more systematically in Sec. V. 

In any case, in view of this we have to expect that 
outputs in real-world applications are not independent 
but come in connected clusters. Moreover, we should ex- 
pect that even within one cluster there are more or less 
strongly connected substructures. We have already dis- 
cussed in Sec. III.c a simple way how to identify these 
clusters. In the present section we present a more sys- 
tematic analysis. 

Our strategy is to estimate a proximity matrix from the 
Mis, and then to use a hierarchical clustering algorithm 
to obtain a dendrogram. No thresholds are used in con- 
structing the dendrogram, i.e. it is constructed without 
making any decision about which MILCA output chan- 



nels are independent or not. Only after its construction 
we decide, usually on heuristic reasons and based on ar- 
guments of practicality and usefulness, which channels 
are actually grouped together. This is more convenient, 
usually, than the algorithms of 0, 0, |5(| where this de- 
cision stands at the starting point of the algorithm or is 
an essential part of it. 

A first technical problem concerns the choice of the 
proximity matrix. One might be tempted to use MI di- 
rectly. But we want to include the possibility that some 
of the channels to be grouped together are already multi- 
dimensional by themselves. In this case, using MI would 
introduce a bias: multivariate channels not only tend to 
carry more information than univariate ones, they also 
will have larger Mis. Therefore we propose to use as a 
similarity measure [bH 



Pi ■ = ~ 
10 dim(J 



I(Si,Sj) 



dim(sj)' 



(25) 



where dim(cc) is the dimension of the variable x, i.e. the 
number of its components. 

In most cluster algorithms the proximity matrix P is 
used only for the first step. In the subsequent steps, 
proximities for clusters are derived from it in some re- 
cursive way |5lj| . In the present paper we propose to 
use 'Mi-based Clustering' (MIC) [lj| which is based on 
the grouping property Eq.@. Thus, a cluster of output 
channels is just characterized by the multivariate signal 
formed by the tuple of its individual channels, and the 
proximity measure is still given precisely by Ea. i|25|) at 
each level of the hierarchy. 

In summary, our cluster algorithm is as follows. We 
start with n (usually univariate) MILCA output chan- 



nels Si 



1, . . .n, and we compute Pu according to 



Ea. (|25|) . After that, we enter the following recursion: 

1. Find the pair with minimum distance in the matrix, 
say clusters i and j; 

2. Combine the clusters i and j to a new cluster (ij) 
with multivariate data §ij, and attribute to it a 
height I(si,Sj) in the dendrogram. Thereby the 
total number of clusters is reduced by one, n <— 

3. If the new value of n is 1 then exit; else 

4. update the proximity matrix P^ and go to 1. 

The dendrogram obtained in this way for the ECG data 
of Sec. III.E is shown in Fig.^] In this figure two clusters 
are clearly distinguishable, the mother cluster containing 
channels (1, 2, 3, 4, 7) and the fetus cluster formed by 
channels 5 and 6. This agrees perfectly with the inter- 
pretation given in Sec. III.E. One can of course debate 
whether, e.g., channel 7 belongs to the mother cluster or 
not, but this can be decided as it seems most convenient, 
and it will in general have little effect on any conclusions. 
One way to make use of such a clustering is in cleaning 
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FIG. 12: Left panel: scatter plot of the two Gaussian sources 
with different spectra. Right panel: Output of the modified 
MILCA algorithm (r = 1 and m = 2), where white Gaussian 
is on top and red Gaussian is on bottom. 
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FIG. 13: Change of I under rotation, for the Gaussian model 
shown in Fig. 1121 The nearly horizontal curve shows the be- 
havior without, the sinusoidal one the result with using delay 
embedding. Here the actual mixing angle is 0. 



the data and separating the individual sources. For that, 
one prunes everything except the wanted cluster, and re- 
constructs the original channels by applying the inverse 
of the matrix W. Results obtained in this way will be 
shown in the next section, after having discussed how to 
take into account temporal structures. 

V. USING TEMPORAL STRUCTURES 

A. Instantaneous Demixing that Minimizes 
Delayed Mutual Informations 

Until now we have not used any time structure in 
the signals. In the following we shall assume the sig- 
nals to be stationary with finite autocorrelation times. 
ICA-algorithms in the literature either use no time infor- 
mation at all (JADE[i3, FastlCAQ, INFOMAXjEI, ...) 
or, if they do use it, they use only second order statistics 
(AMUSE [13, TDSEP @,...). The first group is not able 
to decompose two Gaussian signals with different spec- 
tra, while the second group is not able to separate two 
temporally white signals with different amplitude distri- 
butions. Obviously one has to make use of time structure 
and higher order statistics, to obtain optimal results in 
general 0, 01 • This is precisely what we will do in this 
subsection. 




FIG. 14: Test problem (B) of Sec. Ill, consisting of 5 input 
channels (compare with Figl^J. Algorithm "MILCA*" now 
refers to the minimization of Eq. l28ll . The gray bars on the 
right panel show the full MI given in Eg. 11281 . The embedding 
parameters are m = 2, r = 1. 



Normally the first step in nonlinear time series anal- 
ysis of univariate signals is delay embedding (47]: One 
constructs a formally m-variate signal, for any to > 1, 
by simply forming m-dimensional 'delay vectors' with a 
suitably chosen delay r, 

x(t) = [x(t - r), x(t - 2t), ...x(t- tot)] t . (26) 

Thus one characterizes the "state" of a signal at time 
t by giving not its value at t itself, but at to previous 
times. This makes of course sense only when there is 
any time structure in the signal. Similarly we can also 
embed multivariate signals. For n measured channels one 
obtains thereby an n x to 'delay matrix' 

X(t) = [xi(t),...x„(t)]. (27) 

To decompose an instantaneous linear mixture of n 
signals with either non-Gaussian statistics or with non- 
trivial time structure, we propose to simply minimize the 
MI, 

I(si(t),...s n (t)) = min. (28) 

Notice that we have here considered the delay vectors 
as joint entities, i.e. we do not include in Ea. l|28|) the 
Mis between the different delays of the same Xi. More 
explicitly |55j . 

7(xi(i), . . .x„(t)) = I(xi(t - r), . . . xi(t - tot), 
x 2 (t — t), . . . x 2 (t - mr), . . . 
x n {t - t), . . . x n (t - mr)) (29) 

71 

- y^J(Xj(t - T), . . .Xj(t - tot)) 

i=l 
n 

= £ff(x i (t))-JT(xi(t),...x n (t)) 

i=l 

To minimize this, we proceed again as in Sec. Ill, 
i.e. we decompose the rotation needed to minimize 
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FIG. 15: Upper panel: Two channels of the ECG of a preg- 
nant woman. Lower panel: MILCA output from these two 
channels. 



i~(xi (t), . . . x n (t)) into rotations within each of the n(n — 
l)/2 coordinate planes. Each of the latter rotations still 
involves rotations of m delay coordinate pairs, but this 
can be further decomposed into m rotations where only 
one delay coordinate pair is rotated. We thereby obtain 

/(... )-/(... XiW.-.-x^t)...) 
= lM(t),x' j (t))-I(x i (t),x j (t)) 
,Xi(t — mr)) 
. Xj(t — mr)) 
■x'i(t - mr)) 
. xAt — mr)) 



= I{x t {t - t), . 

+I(x j (t-T),. 
-J(xJ(t-T),. 
-Hx'^t-T),. 



-mllix'^^'At)) - I( Xi (t), Xl (t))} , 



(30) 



where we have used in the last term the fact that 
I(x' i (t), x'j(t)) is independent oft due to stationarity. If 
m = 2, this is again a sum of pairwise Mis. If m > 2, we 
have to estimate m-dimensional Mis directly. 

To illustrate this on a simple example, let us assume 
two channels where x\(t) and x%{b) are instantaneous 
mixtures of two Gaussian signals with the same ampli- 
tude distribution but with different spectra: x\ is white 
(iid), while X2 is red and was obtained by filtering with 
a Butterworth filter of order 6 and with cutoff frequency 
0.3. For simplicity we assume the mixing to be a pure ro- 
tation. Then a scatter plot of the vectors (x\(i) , X2{t)) is 
completely featureless, see Fig. Ir27 reftl. and will not al- 
low a unique decomposition. But using delay embedding 
with m = 2 is sufficient to obtain the original sources 
(Fig. El right panel) . 

Similarly good results were obtained with the less triv- 
ial examples of previous sections. In particular, we tested 
the algorithm on test problem (B) of Sec. III.B (Fig. ll4fl . 
The performance of MILCA is improved substantially, 
even with m = 2. The delayed MI (Eq.®) which make 
use of the time structure serves as a better performance 
value (Fig. d (right)). Now JADE and FastICA are 
also heavily punished for not separating one audio signal 
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FIG. 16: MILCA output from the delay embedded two chan- 
nel ECG with embedding dimension m = 3. 



from Gaussian noise (as one can see the MI for TDSEP 
is nearly unchanged because the time correlation in the 
output is minimal). 



B. Demixing with Delays 

The most general linear demixing ansatz for a station- 
ary system assumes superpositions of the observed signals 
with delays. Using up to m delays r, 2r, . . . mr, we thus 
make the ansatz (see e.g. Ref. [l|, chapter 19) 

N m 

= EW ^-(t-fcr) 

3=1 k=l 



^W,jXj(t) 
3=1 



(31) 



where Xj*(t) is a delay vector as defined in Eq. l|2t)|) and 



b"ij ■ ■ ■ w ij] 



(32) 



Since we have now linear superpositions ofnxm mea- 
surements Xj (t — fcr) on the right hand side, we can also 
determine the same number of §i(t) for each value of t, 
i.e. the index i in Eq. l|31|l runs from 1 to nm. 

This ansatz is obviously more appropriate than instan- 
taneous mixing, if the signals Xi(t) are themselves su- 
perpositions of delayed sources. If they involve a finite 
number of delays, 



= Y^J2 a i3 S 3it- kT ), 
3 fc=l 



(33) 



Ecj. (I31|) with finite m would not give the exact demixing, 
since inverting Eq.(j33J) would require an infinite number 
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of delay terms. Also, Eq.JjHJ in general does not corre- 
spond to the inverse of Eq. , because its solutions are 
in general not components of any delay vectors. But it 
should definitely be a better ansatz than the instanta- 
neous Eq.JTSJl. 

Apart from that, we would anyhow not expect Eq. 
to be the correct model in most applications. The main 
reason why we believe that Eq. I|31|> is useful in many ap- 
plications is that it can cope much better with the situa- 
tion discussed at the beginning of Sec. IV. Assume for the 
moment that there is a single source. Different sensors 
(as, e.g., different ECG contacts) typically see different 
projections of this source, and the signals Xi(t) can there- 
fore be considered as different coordinates describing its 
dynamics. As pointed out by Takens delayed values 
of one single signal can also be considered as different 
coordinates. Our demixing ansatz basically reflects the 
hope that suitable superpositions of delayed values of Xj, 
say, can mimic any other signal Xj . 

To illustrate this, we consider again the above ECG 
recording. We assume for the moment that only the two 
channels with the most pronounced fetus heartbeat are 
available and try to decompose them into mother and fe- 
tus heartbeat. These two channels are shown in Fig. 1151 
(top). They are still dominated by the mother heartbeat. 
But the R peak of the mother has a very different shape 
in both channels: In the lower trace it is mainly posi- 
tive, while it has both positive and negative components 
in the upper. It is therefore clear that there cannot ex- 
ist an instantaneous superposition to which the mother's 
heartbeat does not contribute. Instantaneous ICA must 
fail for this case, as is indeed seen in the lower two traces 
of Fig. CHI 

In order to obtain the least dependent components ob- 
tainable with Ea. (|31[) . we minimize again the MI. But 
now, in contrast to the previous subsection, the output 
variable Si(t) are not delay coordinates of any sources, 
and therefore we must minimize the full MI between all 

Si(t), 



I(sx(t),. 



t (i)) = min. 



(34) 



The minimization is done again, as in all previous 
cases, by performing successive transformations in 2- 
dimensional subspaces and by using Ea. HlOjl . In terms 
of the actual algorithm, the only difference to the pre- 
vious subsection is that we now make rotations in all 
subspaces. 

In our application to the fetal ECG we use embed- 
ding dimension m = 3 and the smallest possible delay, 
t = 1/500 s _1 . Results for the two channels shown in 
Fig. ^] are now shown in Fig. ^] The separation is 
now improved. Although we still have one output chan- 
nel where mother and fetus are strongly mixed (channel 
#4), channel #6 is now practically pure fetal heartbeat. 

Finally, we applied this method to all 8 channels of 
the ECG. Using again m = 3 gives altogether 24 output 
channels. They are shown in fFig. H7|l . and we can clearly 
see which ones are dominated by the mother heartbeat, 
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FIG. 17: MILCA output from the embedded eight channel 
ECG (k=100, m=3) 



FIG. 18: Dendrogram for Fig. 1171 Heights of each cluster 
correspond to I(Xi,Xj) of the cluster ij (k=3). 



which by the fetus, and which by noise. In order to do 
this more objectively, we again apply the cluster algo- 
rithm of Sec. IV, with the result shown in Fig.^3] There 
one can clearly see two big clusters corresponding to the 
mother and to the fetus. There are also some small clus- 
ters which should be considered as noise. 

For any two clusters (tuples) X = X\ . . . X p and 
Y = Yi . . . Y q one has I(X 1: . . . Y q ) > I{X) + I(Y). This 
guarantees, if the MI is estimated correctly, that the tree 
is drawn properly, i.e. each parent node is above the two 
daughter nods. The two slight glitches (when clusters (1 
- 14) and (15 - 18) join, and when (21 -22) is joined with 
23) result from small errors in estimating MI. They do 
not affect our conclusions. 

In Fig. ^3 we show the matrices of pairwise Mis (left 
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panel) and of pairwise variabilities (right). They are 
as expected, and they show much more pronounced 
structures than the matrices without delay embedding 
(Fig. I1U|) . For the Mis one can see a clear block struc- 
ture, i.e. the mother and fetus components are now in- 
deed more independent, as suggested also from the traces 
themselves. From the right panel we see that the main 
mother channels (1-4) and the fetus channels (7-8) are 
very stable. The rest is mostly noise, and is not stable as 
indicated by the very small variabilities. 

The final result of MILCA is obtained by pruning ev- 
erything not belonging to the cluster of interest, 



i 6 cluster C 
else 



(35) 



and performing the back transformation. At this stage 
there arises the problem that the reconstructed signals 



x j>k (t; C) = W J, V r a<(t) , W 4 ,y, fc ) 
are in general not delay vectors, i.e. 

Xj,k+i(t; C) ^ xj t k(t - r; C) . 



(36) 



(37) 



In view of this, one has to make some heuristic decision 
what to use as a cleaned signal. We use simple averages, 



Xj (t; C) = — V £ jtk (t + kr; C) 
m f-^ 



(38) 



k=l 



We do not show all 8 full traces for the mother and fetus, 
because this would not be very informative: The results 
are too clean to be judged on this scale. Instead we show 
in Fig. [501 blow-ups of one of the original traces and the 
contributions to it from the mother and from the fetus. 
The separation is practically perfect. 

Before leaving this section, we should point out that 
one can, in principle, also construct algorithms in be- 
tween those of the last two subsections. In subsection A 
we had used delays to minimize the lagged MI, but we 
had not used the delays in the demixing. In the present 
subsection, we have used the same delays both for min- 
imizing MI and for demixing. A generalization consists 
in using m delays in the demixing, but minimizing the 
MI with additional m! delays. Thus we make the same 
demixing ansatz Ea. (|31|l as above, but we minimize 



I{s 1 {t), . ..s nm (t)) = min. 



(39) 



where we have used the definition of I(s\(t), . . .) given in 
Eq.JSUJ), and = [si(t-r), Si(t-2r), . . . Si(t - m' t)] t . 
Up to now, we have not yet applied this to any problem. 



VI. DISCUSSION 

There is by now a huge literature on independent com- 
ponent analysis. Therefore, most of our treatment is re- 
lated in some form to previous work. One of our basic 
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FIG. 19: Left panel: Pairwise Mis between the estimated 
components shown in Fig. 1171 Right: Square roots of variabil- 
ities <Jij of I(Xi,Xj) (with k = 6). Elements on the diagonal 
have been set to zero. 
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FIG. 20: Short segment from the original ECG (a), of the 
mother and fetus contributions estimated without delay em- 
bedding (b,c), and of the two contributions estimated with 
delay embedding (d,e). 



premises was that we did not so much care about speed, 
but we wanted as precise a dependency measure as possi- 
ble. Our claim that this is provided in principle by MI is 
of course not new. But we believe that our estimator via 
/c-nearest neighbor statistics is new and provides the most 
precise mutual information estimate. It is closely related 
to similar estimators for differential entropies which had 
been used in |lll Il2| , and the quality of our results in 
the most simple 2-d blind source separation problem is 
very similar to the one in |12| . The main virtue of our 
MI estimator, compared to all previous MI estimators, 
is the numerical fact that it becomes unbiased when the 
two distributions are independent. 

While using differential entropies instead of MI would 
give the same quality and somewhat simpler codes for 
the basic blind separation problem, using MI has other 
advantages: with it we can estimate the residual de- 
pendencies between the output components. Our use of 
this knowledge for estimating the output uniqueness and 
robustness, by measuring how the dependencies change 
under re-mixing, seems to be new. Previous authors 
used for this problem resamplings and/or noise addition 



In addition to this, we used the Mis between the out- 
puts to cluster them, and we then used this clustering to 
obtain the contributions of the individual (multidimen- 
sional) sources to the measured signals. The observation 
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that "independent" component analysis will in general, 
when applied to real world data, not give independent 
components is not new either 3, S 0- We stress it 
by calling our approach a "least dependent" component 
analysis. Our detailed implementation of this idea seems 
to be new, not the least because our clustering algorithm 
is novel and uses a specific property of MI not shared by 
other contrast functions. 

Although the extension of our algorithm to data with 
time structure discussed in Sec.V.A seems straightfor- 
ward, this strategy of combining in the contrast function 
deviations from Gaussianity both at equal times and at 
non-equal times has been considered in very few papers 
only |2, |Ei|- The present paper is the first which uses 
directly MI for combining these two aspects. In Sec. V. 
was shown that this can substantially improve the sepa- 
ration e.g. of audio signals. 

Both the ansatz of Sec.V.A and the method of demix- 
ing with delays in Sec.V.B arc entirely based on MI, and 
use essentially the same algorithm. Therefore, also the 
generalization mentioned at the end of Sec.V.B uses es- 
sentially the same basic algorithm. This last general- 
ization was never considered before, but demixing with 
delays is of course a very widely treated concept (see e.g. 
0). It is usually called 'convolutive mixing'. In our 
presentation we stressed several features which are typi- 
cally overlooked. One is that the 'convolutive' demixing 
ansatz Ea. (|31|l is in general, when the sources Si(t) are 
not strictly independent, not equivalent to a convolutive 
mixing ansatz, because the sources then will not be com- 
ponents of delay vectors. This is also the reason why we 
avoided the term 'convolutive mixing'. 

Just as ICA may be considered as a generalization 
of principle component analysis (PCA) to non-Gaussian 
contrast functions, mixing with delays is a generalization 
of multivariate singular source analysis (SSA) |57l IHsj 
to include non-Gaussianity. Univariate SSA, see e.g. 
[4^. IH^ , is often considered as an alternative to Fourier 
decomposition and has found many applications, while 
multivariate SSA was mainly used in geophysics. Indeed, 
we consider blind source separation algorithms based on 
temporal second order statistics (AMUSE, TDSEP) as 
more closely related to multivariate SSA than to other 
ICA methods based on nonlinear contrast functions. 

While we discussed also a number of other applications 
and test models, our main test problem was the ECG of 
a pregnant woman, and the task was mainly to extract 
a clean fetal ECG. We have chosen this partly because 
this ECG was already used in previous ICA analyses 
H H E3|. We believe that our method clearly outper- 
formed these and gives nearly perfect results, although 
we should admit that the signals to start with were al- 
ready exceptionally clean. It would be of interest to see 
how our method performs on more noisy (and thus more 
typical) ECGs. Obtaining fetal ECGs should be of con- 
siderable clinical interest, although it is not practiced at 
presence, mainly because of the formidable difficulties to 
extract them with previous methods. In this respect we 



should mention the seminal work of 

HI 

l6l) where fetal 

ECGs were extracted even from univariate signals using 
locally nonlinear methods. It would be interesting to see 
how our method compares with such a nonlinear method 
when the latter is used for multivariate signals. 

Throughout the paper, we used total MI as a contrast 
function. One might a priori think that the sum of all 
pairwise Mis would be easier to estimate, and could be 
as useful as the total MI. Neither is true. One reason for 
the efficiency of our algorithm is that changes of the total 
MI under linear remixings can be estimated by comput- 
ing only pairwise Mis (except for the method of Sec. V.A 
with embedding dimension m > 2). Thus one needs to 
compute the full high-dimensional Ml only once. For 
all changes during the minimization, computing pairwise 
Mis is sufficient. But this does not mean that total MI 
is essentially a sum of pairwise Mis. We showed in the 
appendix that this can be very wrong. And we found 
in more realistic applications that the sum over all pair- 
wise Mis sometimes increases when we minimize total 
ML Therefore we consider the sum over all pairwise Mis 
as a very bad contrast function. 

This is somewhat surprising if one considers ICA as a 
generalization of PCA. PCA can be viewed as minimal- 
ization of the sum over all squared pairwise covariances. 
But we believe that this close relation between ICA and 
PCA is somewhat misleading anyhow. It is usually based 
on this analogy that the data are first pre-whitened, be- 
fore the ICA analysis proper is made, which is then re- 
stricted to pure rotations. We showed by means of a 
counter example that this can lead to a solution which 
does not have minimal MI. This was a rather artificial 
example, and the problem might not be serious in prac- 
tice (all our results were obtained, for simplicity, with 
prewhitening). But one should keep it in mind in future 
applications. 

Finally we should point out that Eqs. (|9I1U|) hold for 
the exact MI, but are only approximately true for our es- 
timators. Therefore, working directly on higher dimen- 
sional Mis, without breaking their changes down to 2- 
dimensional contributions, can give slightly different re- 
sults. We found no big systematic trends, although we 
expect in general that estimates using the smallest di- 
mensions are most reliable. The reason is that they are 
based on smaller distances for fixed k, or use larger k 
when using the same distances. The first reduces sys- 
tematic errors, the second statistical ones. The decrease 
in CPU time when using Eq. l|l(Jf) to decrease the effective 
dimensionality is a further important point. 



VII. CONCLUSION 

In the first part of the paper we discussed the classical 
linear instantaneous ICA model and introduced a new 
algorithm which shows better results than conventional 
ICA-algorithms. Our algorithm should be particularly 
useful for real world data, since it works with actual de- 
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pendencies between reconstructed sources (as measured 
by mutual informations), and thus easily allows to study 
the question how independent and unique are the found 
components. 

In the following we discussed the case where outputs 
can be grouped together for a meaningful interpretation. 
We again saw that MI has some properties which makes 
it the ideal contrast function, also for this purpose. 

Finally, when we included time domain structures, we 
again could use the same estimates of MI, with basically 
the same algorithms. This - and the excellent results 
when applied to a fetal electrocardiogram - suggest that 
our method of basing independent component analysis 
systematically on highly precise estimates of MI is very 
promising. It is true that our method is slower than ex- 
isting algorithms like FastICA or JADE, but we believe 
that the improved results justify this effort in many situ- 
ations, in particular in view of the ever-increasing power 
of digital computers. 

The software implementation of the MILCA algorithm 
is freely available online |rj3 ]. 

Appendix 

In this appendix we give two counter examples showing 
somewhat counterintuitive features of the MI. In the first 
example we have two continuous variables, and the joint 
density is constant in an L-shaped domain 

D = {[0,! I ]x[0,e]U[0, E ]x[0,y}. (40) 

It is zero outside D. It is easily seen that I(X, Y) — > h 
in the limit e — > 0, with h = plogp + (1 — p) log(l — p) 



and with p = l x /(l x + l y ). In this limit, the marginal 
distributions are superpositions of a delta peak at x or y 
equal to zero, and a uniform distribution on [0, 1]. The 
components have relative weights l x : l y . The only infor- 
mation about y learned by fixing x is on which arm the 
pair (x,y) is located, and for this h bits are sufficient. 



On the other hand, any linear transformation applied 
to the (x, y)-plane would give an L-shaped figure with at 
least one oblique arm. For such a distribution knowing x 
would specify y with an accuracy ~ e, and thus I(X 7 Y) ~ 
— loge — > oo for e — ► 0. But the covariance between 
X and Y is not zero, hence the minimal MI is reached 
(for small e) when the correlation coefficient r is non- 
zero. A more detailed analysis shows that I(X, Y) of the 
distribution rotated by an angle (f> is not symmetric under 
<t> — * —<f>, if lx 7^ ly 



The second example is one of three random variables 
X, Y , and Z which are pairwise strictly independent, but 
globally dependent. For simplicity, the example uses dis- 
crete and indeed binary variables. We have thus 8 prob- 
abilities p(x,y,z) for each variable being either or 1, 
and we chose them as p(0, 0, 0) = p(l, 1, 0) = p(0, 1, 1) = 
p(l,0,l) = l/8+eandp(0,0,l) = p(0,l,0) = p(l,0,0) = 
p(l, 1, 1) = 1/8 — e. For this choice all pairwise probabil- 
ities are 1 /4, but I(X, Y, Z) ^ 0. 
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